K Nearest Neighbors

For this we will use a classfied dataset.The feature column names are hidden but we are given the data and the target classes.

We will use KNN to create a model that directly predicts a class for a new data point based off the features

Import Libraries


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Get the data


In [2]:
df = pd.read_csv('Classified Data',index_col=0)
#set index_col=0 to use the first column as the index

In [3]:
df.head()


Out[3]:
WTT PTI EQW SBI LQE QWG FDJ PJF HQE NXJ TARGET CLASS
0 0.913917 1.162073 0.567946 0.755464 0.780862 0.352608 0.759697 0.643798 0.879422 1.231409 1
1 0.635632 1.003722 0.535342 0.825645 0.924109 0.648450 0.675334 1.013546 0.621552 1.492702 0
2 0.721360 1.201493 0.921990 0.855595 1.526629 0.720781 1.626351 1.154483 0.957877 1.285597 0
3 1.234204 1.386726 0.653046 0.825624 1.142504 0.875128 1.409708 1.380003 1.522692 1.153093 1
4 1.279491 0.949750 0.627280 0.668976 1.232537 0.703727 1.115596 0.646691 1.463812 1.419167 1

Standardize the variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.


In [16]:
from sklearn.preprocessing import StandardScaler

In [17]:
scaler = StandardScaler()

In [18]:
scaler.fit(df.drop('TARGET CLASS',axis=1))


Out[18]:
StandardScaler(copy=True, with_mean=True, with_std=True)

In [19]:
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))

In [20]:
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat


Out[20]:
WTT PTI EQW SBI LQE QWG FDJ PJF HQE NXJ
0 -0.123542 0.185907 -0.913431 0.319629 -1.033637 -2.308375 -0.798951 -1.482368 -0.949719 -0.643314
1 -1.084836 -0.430348 -1.025313 0.625388 -0.444847 -1.152706 -1.129797 -0.202240 -1.828051 0.636759
2 -0.788702 0.339318 0.301511 0.755873 2.031693 -0.870156 2.599818 0.285707 -0.682494 -0.377850
3 0.982841 1.060193 -0.621399 0.625299 0.452820 -0.267220 1.750208 1.066491 1.241325 -1.026987
4 1.139275 -0.640392 -0.709819 -0.057175 0.822886 -0.936773 0.596782 -1.472352 1.040772 0.276510
5 -0.399853 1.591707 0.928649 1.477102 0.308440 0.263270 1.239716 0.722608 -2.206816 0.809900
6 -0.017189 0.534949 0.826189 -1.723636 -0.147547 -2.010505 -0.206348 -1.096313 -0.158215 -1.233974
7 -0.461182 -0.100835 0.210071 -1.909291 -0.366695 0.396875 0.718122 0.934523 0.228458 0.308929
8 -0.598054 1.360189 -0.172618 -1.502292 -1.192485 0.504787 -0.325981 0.834346 -0.136536 -0.670199
9 -0.612806 -2.331876 0.197211 1.127356 1.636853 -0.225233 0.948308 -1.644881 1.309064 -1.865764
10 1.158303 0.843392 -0.738540 -0.109278 0.019954 -0.883576 -0.725209 -1.636366 0.297779 0.386875
11 0.396126 1.169777 0.215062 -0.230956 1.707287 -0.592248 -0.163717 0.572255 -2.023123 0.298564
12 -0.921372 0.546474 0.565336 -1.239044 -0.085144 -1.467510 1.457694 0.075387 -1.060100 0.059053
13 0.012562 0.796571 -0.930673 0.973308 0.887731 -1.038074 0.211314 -1.883052 1.306107 -1.194205
14 -0.512670 -0.692383 -0.470705 1.264935 -0.744117 1.951558 0.906128 1.971821 -0.134587 0.312793
15 0.386919 -0.270127 -0.804363 -0.944600 -0.419448 0.120341 0.980539 0.607923 0.202234 1.275846
16 -0.808444 1.028545 0.561652 -0.030947 0.808052 1.092841 -1.467019 1.689324 -1.065861 -0.558538
17 -0.172337 0.434208 1.704690 -1.507906 0.991544 0.612621 0.220380 0.526007 -0.152626 0.624265
18 -0.227533 0.628856 0.308635 -0.057883 0.972846 1.713172 -0.922135 -1.195267 -0.621333 1.426053
19 -0.626517 1.089683 -0.037780 -0.305089 0.966241 -0.240596 0.552006 2.039341 -1.414778 -1.293438
20 1.047297 2.130779 0.302225 0.840850 1.568143 1.511576 -0.259417 0.723631 -1.434843 -0.831651
21 -1.164782 0.874384 0.818369 0.510145 -1.272834 0.766933 -0.044192 2.319503 -0.467466 -0.801096
22 -0.994955 -1.987655 0.171669 -0.860776 -0.822289 0.590328 0.755795 0.969469 -1.072573 0.619645
23 1.132875 -0.736699 -2.275858 1.635880 -0.149610 -0.037040 -0.335809 -0.163679 0.799924 -0.658005
24 0.553633 0.969652 -0.640984 -1.367843 1.268089 0.566298 0.422873 0.209496 -1.239697 0.077090
25 -1.287926 0.796890 0.964231 1.959940 1.972225 0.043477 0.542992 -0.153612 0.323974 1.116183
26 -0.323861 -0.991221 0.000366 0.974057 0.639803 1.778300 -0.218305 -0.710702 1.914874 0.321497
27 -0.128771 -0.695307 -1.052901 -0.054727 -0.218007 -0.679450 -1.682344 -1.296608 -0.586989 0.877280
28 -0.939247 -0.285971 -0.612030 -0.882996 0.357854 1.106127 0.033049 0.052382 -2.701361 -0.957683
29 1.531656 0.308442 -0.233201 1.215748 0.792063 0.728028 -1.288932 -1.947836 0.758620 0.213241
... ... ... ... ... ... ... ... ... ... ...
970 -1.092590 -0.427321 1.420622 1.643436 1.105013 -0.118534 0.253062 0.111280 -0.723191 2.285684
971 -2.049159 0.266015 -0.553350 1.682972 -2.058521 1.215624 1.149331 0.485726 -0.554716 -0.730912
972 0.664348 0.310275 1.596776 0.236329 1.558145 0.071995 -0.834003 0.104553 -1.286569 1.657803
973 -2.119896 0.890052 -0.027560 0.702566 -0.308356 -0.250975 1.322750 1.453298 -0.197014 -0.957779
974 -0.500890 -1.467881 -1.720976 -0.068558 -0.055904 -0.998115 1.305390 0.452292 0.379526 -0.837198
975 2.238760 -0.726363 -0.960185 0.298757 -0.534868 0.161135 1.481551 -0.385902 0.402761 0.257162
976 -0.253088 -1.070175 1.183553 -1.860383 1.674244 -0.457098 -0.910469 0.181316 -1.036693 -0.802952
977 -0.096739 0.962653 1.458516 -0.140186 1.390756 1.806112 0.474573 1.219110 -1.221264 0.082680
978 -1.994428 -0.538382 -0.976454 -0.710393 0.508747 0.625113 1.001561 -1.193159 -0.939098 -1.830024
979 1.579881 -0.446266 -1.440703 -0.364799 0.567161 0.225818 -0.518541 0.188968 1.175769 -1.509489
980 -0.212821 -1.389102 2.037408 0.693870 0.703562 -0.622200 -0.881123 0.306874 -2.137317 0.986439
981 0.777620 -0.373316 -0.473712 -1.799234 0.342740 2.275867 0.155346 -0.230044 -0.521091 -0.656865
982 -0.050027 -1.235786 -0.105953 0.012564 0.201000 -0.464177 -0.748513 -0.259253 1.015712 1.752788
983 1.073876 -1.962424 -0.876976 -0.686251 0.364578 -0.805103 0.934637 -1.355439 -0.372666 0.105382
984 0.041168 0.032588 -2.067900 0.804337 1.450942 -0.226436 -1.015921 0.942501 -0.633783 -2.185136
985 0.692676 -1.509723 0.085038 0.299556 -1.613375 -1.751818 -0.577131 -2.076531 0.358318 -0.636963
986 -2.121108 -0.140188 0.436901 -0.265337 1.076150 1.345130 -0.453666 0.199567 -0.767308 0.378845
987 -0.319126 -1.382602 -0.098888 -0.208905 0.023272 -0.466318 -0.416867 2.152512 0.150629 0.271522
988 -1.165931 0.936232 1.075620 0.227828 1.134342 0.410832 -0.224009 1.397082 -0.365324 0.890242
989 -1.620069 0.489926 -0.274300 0.533786 1.233058 -0.195697 0.453712 -0.872851 0.609338 0.905628
990 -0.254135 -0.668941 0.777185 3.467701 -0.877809 2.070765 1.344923 0.780459 -2.164150 -0.373942
991 0.528275 -0.416955 -1.026314 -0.212952 -1.214779 -0.308101 0.457688 0.549690 0.075773 1.541641
992 -0.483798 1.900701 0.538140 -0.140141 0.355732 -0.170696 -0.173750 1.858942 -0.611853 -0.426720
993 -0.746120 -0.251663 -0.360088 0.738086 2.136036 0.042642 -1.937310 -0.726449 1.044144 -1.342160
994 0.908385 -1.071156 -1.297545 0.397858 0.241989 -0.582658 -0.889448 0.313037 1.207480 0.256920
995 0.211653 -0.312490 0.065163 -0.259834 0.017567 -1.395721 -0.849486 -2.604264 -0.139347 -0.069602
996 -1.292453 -0.616901 0.369613 0.482648 1.569891 1.273495 0.362784 -1.242110 -0.679746 1.473448
997 0.641777 -0.513083 -0.179205 1.022255 -0.539703 -0.229680 -2.261339 -2.362494 -0.814261 0.111597
998 0.467072 -0.982786 -1.465194 -0.071465 2.368666 0.001269 -0.422041 -0.036777 0.406025 -0.855670
999 -0.387654 -0.595894 -1.431398 0.512722 -0.402552 -2.026512 -0.726253 -0.567789 0.336997 0.010350

1000 rows × 10 columns

Train Test Split


In [21]:
from sklearn.model_selection import train_test_split

In [22]:
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],
                                                    test_size=0.30)

Using KNN

We'll start with k=1.


In [23]:
from sklearn.neighbors import KNeighborsClassifier

In [24]:
knn = KNeighborsClassifier(n_neighbors=1)

In [25]:
knn.fit(X_train,y_train)


Out[25]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

In [27]:
pred = knn.predict(X_test)

Predictions and Evaluations


In [28]:
from sklearn.metrics import classification_report,confusion_matrix

In [29]:
print(confusion_matrix(y_test,pred))


[[134  21]
 [ 15 130]]

In [30]:
print(classification_report(y_test,pred))


             precision    recall  f1-score   support

          0       0.90      0.86      0.88       155
          1       0.86      0.90      0.88       145

avg / total       0.88      0.88      0.88       300

Choosing a K Value

Using elbow method to pick a good K value:


In [31]:
error_rate = []

for i in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [38]:
plt.figure(figsize=(10,6))
plt.plot(range(1,30),error_rate,color='blue', linestyle='-', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')


Out[38]:
<matplotlib.text.Text at 0x7f57fc90c310>

In [39]:
# FIRST A QUICK COMPARISON TO OUR ORIGINAL K=1
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=1')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))


WITH K=1


[[134  21]
 [ 15 130]]


             precision    recall  f1-score   support

          0       0.90      0.86      0.88       155
          1       0.86      0.90      0.88       145

avg / total       0.88      0.88      0.88       300


In [40]:
# NOW WITH K=16
knn = KNeighborsClassifier(n_neighbors=16)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=16')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))


WITH K=16


[[139  16]
 [  5 140]]


             precision    recall  f1-score   support

          0       0.97      0.90      0.93       155
          1       0.90      0.97      0.93       145

avg / total       0.93      0.93      0.93       300

The End!


In [ ]: